Sparse Principal Component of a Rank- deficient 

Matrix 



Megasthenis Asteris and Dimitris S. Papailiopoulos 
Department of Electrical Engineering 
University of Southern California 
Los Angeles, CA 90089 USA 
Email: {asteris, papailio}@usc . edu 



George N. Karystinos 
Department of Electronic and Computer Engineering 
Technical University of Crete 
Chania, 73100, Greece 
Email: karystinos@telecom.tuc.gr 



Abstract — We consider the problem of identifying the sparse 
principal component of a rank-deficient matrix. We introduce 
auxiliary spherical variables and prove that there exists a set 
of candidate index-sets (that is, sets of indices to the nonzero 
elements of the vector argument) whose size is polynomially 
bounded, in terms of rank, and contains the optimal index- 
set, i.e. the index-set of the nonzero elements of the optimal 
solution. Finally, we develop an algorithm that computes the 
optimal sparse principal component in polynomial time for any 
sparsity degree. 

I. Introduction 

Principal component analysis (PCA) is a well studied, popu- 
lar tool used for dimensionality reduction and low-dimensional 
representation of data with applications spanning many fields 
of science and engineering. Principal components (PC) of a 
set of "observations" on some N variables capture orthog- 
onal directions of maximum variance and offer a Euclidean- 
distance-optimal, low-dimensional visualization that -for many 
purposes- conveys sufficient amount of information. Without 
additional constraints, the PCs of a data set can be computed 
in polynomial time in N, using the eigenvalue decomposition. 

One disadvantage of the classical PCA is that, in general, the 
extracted eigenvectors are expected to have nonzero elements 
in all their entries. However, in many applications sparse 
vectors that maximize variance are more favorable. Sparse PCs 
can be less complicated to interpret, easier to compress, and 
cheaper to store. Thus, if the application requires it, then some 
of the maximum variance property of a PC may be slightly 
traded for sparsity. To mitigate the fact that PCA is oblivious 
to sparsity requirements, an additional cardinality constraint 
needs to be introduced to the initial variance maximization 
objective. The sparsity aware flavor of PCA, termed sparse 
PCA, inarguably comes at a higher cost: sparse PCA is an 
NP-Hard problem [9]. 

To approximate sparse PCA various methods have been 
introduced in the literature. Initially, factor rotation techniques 
that extract sparse PCs were used in [7], [4]. Straightforward 
thresholding of PCs was presented in [3] as a computationally 
light means for obtaining sparsity. Then, a modified PCA 
technique based on the LASSO was introduced in [5]. In [13] 
an elaborate nonconvex regression-type optimization approach 
combined with LASSO penalty was used to approximately 



tackle the problem. A nonconvex technique, locally solv- 
ing difference-of-convex-functions programs was presented 
in [12]. Semidefinite programming (SDP) was used in [1], 
[14], while [15] augmented the SDP approach with an extra 
greedy step that offers favorable optimality guarantees under 
certain sufficient conditions. The authors of [10] considered 
greedy and branch-and-bound approaches, further explored in 
[11]. Generalized power methods using convex programs were 
also used to approximately solve sparse PCA [6]. A sparse- 
adjusted deflation procedure was introduced in [8] and in 
[2] optimality guarantees were shown for specific types of 
covariance matrices under thresholding and SDP relaxations. 

Our Contribution: In this work we prove that the sparse 
principal component of a matrix C can be obtained in poly- 
nomial time under a new sufficient condition: when C can be 
written as a sum of a scaled identity matrix plus an update, 
i.e. C = ctIat + A, and the rank of the update A is not a 
function of the problem size. 1 Under this condition, we show 
that sparse PCA is polynomially solvable, with the exponent 
being only a linear function of the rank. This result is possible 
after introducing auxiliary spherical variables that "unlock" 
the low-rank structure of A. The low-rank property along with 
the auxiliary variables enable us to scan a constant dimensional 
space and identify a polynomial number of candidate sparse 
vectors. Interestingly, we can show that the optimal vector 
always lies among these candidates and a polynomial time 
search can always retrieve it. 

II. Problem Statement 

We are interested in the computation of the real, unit-length, 
and at most iT-sparse principal component of the N x N 
nonnegative definite matrix C, i.e. 

x opt = arg max x T Cx (1) 

xes» 

where = {xeM": ||x|| = 1, card(x) < K }. Interest- 
ingly, when C can be decomposed as a low-rank update of a 
constant identity matrix, i.e. C = crljv + A where a e R, Ijv 
is the N x N identity, and A is a nonnegative definite matrix 

'if a = 0, then we simply have a low-rank matrix C. 



with rank D, then x T Cx = erj|x|| 2 + x T Ax. Therefore, the 
optimization (1) can always be rewritten as 



x opt = arg max x T Ax 



(2) 



where the new matrix A has rank D. Since A is a nonnegative 
definite matrix, it can be decomposed as A = VV T where 



A 



V = [vi v 2 



V£>] and problem (1) can be written as 



x opt = arg max x VV x = arg max II V x| 



(3) 



In the following, we show that when D is not a function of 
N, (1) can be solved in time O (poly(iV)). 

III. Rank-1 and Rank-2 Optimal Solutions 

Prior to presenting the main result for the general rank D 
case, in this section we provide insights as to why sparse PCA 
of rank deficient matrices can be solved in polynomial time, 
along with the first nontrivial case of polynomial solvability 
for rank-2 matrices A. 

A. Rank-1: A motivating example 

In this case, A has rank 1, V = v, and (3) becomes 



c opt = arg max v x 

xes« 



arg max 



N 

E 



(4) 



It is trivial to observe that maximizing (4) can be done by 
distributing the K nonzero loadings of x to the K absolutely 
largest values of v, indexed by set /. Then, the optimal 
solution x opt has the following K nonzero loadings 

- V/ (*\ 

x opt.I n II ■ W) 

l|v/|| 

where x op t,/ denotes the set of elements of x opt indexed by I 
and x pt,i = for all i ^ I. 

The leading complexity term of this solution is determined 
by the search for the K largest element of v, which can 
be done in time O(N) [16]. Therefore, the rank-1 -optimal 
solution can be attained in time that is linear in N. 

B. Rank-2: Introducing spherical variables 

In this case, V is an N x 2 matrix. A key tool used in 
our subsequent developments is a set of auxiliary spherical 
variables. For the rank-2 case, we introduce a single phase 
variable </> e (— §, and define the polar vector 



c(0) = 



sin ( 

COS ( 



(6) 



which lies on the surface of a radius 1 circle. Then, from 
Cauchy-Schwartz Inequality we obtain 



|c T (^)V T x|<||c(^)||||V T x|| 



|V J x|| 



(7) 



with equality if and only if c(<j>) is parallel to V T x. Therefore, 
finding x that maximizes ||V T x|| in (3) is equivalent to 
obtaining x of the (x, <fi) pair that maximizes |c T (</>)V T x|. 
Initially, this rewriting of the problem might seem unmoti- 
vated, however in the following we show that the use of c(<f>) 



unlocks the low-rank structure of V and allows us to compute 
the sparse PC of A by solving a polynomial number of rank-1 
instances of the problem. 

We continue by restating the problem in (3) as 



max 

xGS" 



|V T ? 



max max c (0)V x 

xGS£0G(-f,f] 1 1 

max max | c T (d))V T x| 
0e(-f,f]xGS« v ^ — - 



If we fix <fr, then the internal maximization problem 

max |v T (0)x| 

xes™ 



(8) 



(9) 



is a rank-1 instance, for which we can determine in time 0(N) 
the optimal index-set /(</>) corresponding to the indices of the 
K absolutely largest elements of v(0) = Vc(</>). 

However, why should (f> simplify the computation of a 
solution? The intuition behind the polar vector concept is that 
every element of 

/ V^i sin0 + Vi.2 cos0 \ 



Vc($ = 



(10) 



\Vat,i sin</> + Vn,2 cos 4>J 

is actually a continuous function of 0, i.e. a curve in <j>. 
Hence, the K absolutely largest elements of Vc(</>) at a 
given point (f> are functions of (f>. Due to the continuity of 
the curves, we expect that the index set I((f>) will retain the 
same elements in an area "around" </>. Therefore, we expect 
the formation of regions, or "cells" on the <fr domain, within 
which the indices of the K absolutely largest elements of 
v(0) remain unaltered. A sorting (i.e., an /-set) might change 
when the sorting of the amplitudes of two element in v(0) 
changes. This occurs at points (f>, where these two absolute 
values become equal, that is, points where two curves intersect. 
Finding all these intersection points, is sufficient to determine 
cells and construct all possible candidate /-sets. Among all 
candidate /-sets, lies the set of indices corresponding to the 
optimal X-sparse PC. Exhaustively checking the /-sets of 
all cells, suffices to retrieve the optimal. Surprisingly, the 
number of these cells is exactly equal to number of possible 
intersections among the amplitudes of v(^), which is exactly 
equal to 2(^) = O (N 2 ), counting all possible combinations 
of element pairs and sign changes. 

Before we proceed, in Fig. 1 we illustrate the cell partition- 
ing of the <f> domain, where we set N = 4 and K — 2 and 
plot the magnitudes of the 4 curves that originate from the 
4 rows of Vc((f>). Cells (intervals) are formed, within which 
the sorting of the curves does not change. The borders of 
cells are denoted by vertical dashed lines at points of curve 
intersections. Our approach creates 12 cells which exceeds the 
total number of possible index-sets, however this is not true for 
greater values of N. Moreover, we use Ri regions to denote 
the sorting changes with respect to only the /C-largest curves. 
These regions is an interesting feature that might yet decrease 
the number of cells we need to check. However, due to lack 
of space we are not exploiting this interesting feature here. 




Fig. 1. N = 4, K = 2, rank-2 case: Cells on the <f> domain. 

C. Algorithmic Developments and Complexity 

Our goal is the construction of all possible candidate K- 
sparse vectors, determined by the index-sets of each cell on 
the <\> domain. This is a two step process. First we need to 
identify cell borders, and then we have to determine the index 
sets associated with these cells. 

Algorithmic Steps: We first determine all possible inter- 
sections of curve pairs in v(0). Any pair {i,j} of distinct 
elements in v(4>) is associated with two intersections: Vi(fa) — 
Vj(<p) and Vi(4>) — —Vj(<j>). Solving these two equations with 
respect to (j>, determines a possible point where a new sorting 
of the /T-largest values of v((f>) might occur. 

Observe that at an intersection point, the two values 
Vi ((/>), Vj(4>) are absolutely the same. Exactly on the point 
of intersection, all but 2 (the ith and jth) coordinates of a 
candidate K sparse vector can be determined by solving a 
rank 1 instance of the problem. However, we are left with 
ambiguity with respect to 2 coordinates i and j that needs 
to be resolved. To resolve this ambiguity, we can visit the 
"outermost" point of the <f> domain, that is, |. There, due to 
the continuity of Vi(fa and ±Vj(<fi), the sortings within the two 
cells defined by the two intersections, will be the identical, 



or opposite sortings of Vi 



and ±Vi 



depending on 



whether Uj(0) and ±Vj(fa are both positive, or negative at 
the intersection point, respectively. 

Having described how to resolve ambiguities, we have fully 
described a way to calculate the /-set at any intersection point. 
Apparently, all intersection points, that is, all (^) pairwise 
combinations of elements in v(0), have to be examined to 
yield a corresponding /-set. 

Computational Complexity: A single intersection point can 
be computed in time 0(1). At a point of intersection, we have 
to determine the K-th order element of the absolute values of 
v((j>) and the K — 1 elements larger than that, which can be 
done in time O(N). Resolving an ambiguity costs 0(1). So 
in total, finding a single /-set costs O(N). Constructing all 
candidate /-sets requires examining all 2(^) points, implying 
a total construction cost of 2(^) x 0{N) = (iV 3 ). 



IV. The Rank-/? Optimal Solution 

In the general case, V is a N x D matrix. In this section 
we present our main result where we prove that the problem 
of identifying the K -sparse principal component of a rank-/? 
matrix is polynomially solvable if the rank D is not a function 
of N, The statement is true for any value of K (that is, even 
if K is a function of N). Our result is presented in the form 
of the following proposition. The rest of the section contains 
a constructive proof of the proposition. 

Proposition 1: Consider a N x N matrix C that can be 
written as a rank-/) update of the identity matrix, that is, 



C = (Tl 



N 



(ID 



where a £ R and A is a rank-/) symmetric positive semidef- 
inite matrix. Then, for any K — 1, 2, . . . , N, the /C-sparse 
principal component of A that maximizes 



c T Cx 



(12) 



subject to the constraints ||x|| = 1 and card(x) < K can be 
obtained with complexity 0{N D+1 ). □ 
We begin our constructive proof by introducing the spherical 
coordinates fa, <f>2, ■ . ■ , 4>d-i <= ( — f, f] an d defining the 
spherical coordinate vector 



the hyperpolar vector 



(13) 



c(0 1: 



A 



D-li 



COS 



smfa 
cos 4>± sin fa 
cos fa sin 03 



cos fa cos 02 • 
cos fa cos fa . 



. sm0 D _i 

. COS faj-i 



(14) 



and set <P = (— f ; fl- Then, similarly with (8) and due to 
Cauchy-Schwartz Inequality, our optimization problem in (3) 
is restated as 



max ||V T x| 



max max x Vc((4,.n_ 1 ) . (15) 



Hence, to find x that maximizes ||V T x|| in (3), we can equiv- 
alently find the (x, fa) pair that maximizes |x T Vc((/> 1 . j:) _ 1 )|. 
We interchange the maximizations in (15) and obtain 



max V J x 



max ( max |x T Vc(<A,. n _ 1 ) 

v '(0i : d-i) 



(16) 

For a given point <p 1 . D _ 1 , V c(<p 1 . D _ 1 ) is a fixed vector and 
the internal maximization problem 



max x Vc(0 1:jD _ 1 ) = max x v (fa. D -i, 



(17) 



is a rank-1 instance. That is, for any given point 4> 1 . D _ 1 , 
we can determine the optimal set /(0> 1 . D _ 1 ) of the nonzero 
elements of x as the set of the indices of the K largest 
elements of vector |Vc((/) 1 . £ ,_ 1 )|. 



To gain some intuition into the purpose of inserting 
the second variable cf> 1 . D _ 1 , notice that every element of 
±Vc(4> 1 . D _ 1 ) is actually a continuous function of cf) 1:D _ 1 , 
a D-dimensional hypersurface and so are the elements of 
|Vc(0 1:£ >_i)|. When we sort the elements of \Vc(4> 1:D _ 1 )\ 
at a given point 4> 1 . D _ 1 , we actually sort the hypersurfaces 
at point 4>\-.d-i according to their magnitude. The key ob- 
servation in our algorithm, is that due to the continuity of 
the hypersurfaces in the $ D_1 hypercube, we expect that in 
an area "around" 4> 1 . D _ 1 the hypersurfaces will retain their 
magnitude-sorting. So we expect the formation of cells in the 
$ D_1 hypercube, within which the magnitude-sorting of the 
hypersurfaces will remain unaltered, irrespectively of whether 
the magnitude of each hypersurface changes. Moreover, even if 
the sorting of the hypersurfaces changes at some point around 
4>i;D-i ^ i s possible that the I does not change. So we expect 
the formation of regions in the hypercube which expand 

over more than one cells and within which the /-set remains 
unaltered, even if the sorting of the hypersurfaces changes. 
If we can efficiently determine all these cells (or even better 
regions) and obtain the corresponding /-sets, then the set of all 
candidate index-sets may be significantly smaller than the set 
of all (^) possible index-sets. Once all the candidate /-sets 
have been collected, / opt and x opt will be determined through 
exhaustive search among the candidate sets. 

In (17) we observed that at a given point ^v.d-i tne maxi- 
mization problem resembles the rank-1 case and consequently, 
the /-set at 4>\.£>_\ consists of the indices of the K largest 
elements of \Vc(4> 1 . D _ 1 )\. Motivated by this observation, we 
define a labeling function /(•) that maps a point 4> 1 . D _ 1 to an 
index-set 

/(Vjv x d;0i : d_i) = argmax^KVc^!.^!)).!. (18) 

iei 

Then, each point 4> 1 . D _ 1 e ^ D ^ 1 is mapped to a candidate 
index-set and the optimal index-set / opt belongs to 

XtottVjvxB) = (J {I(Vn*d; 4>i-.d-i)}- (19) 



In the following, we (i) show 
number of candidate index-sets is 



that 



the 

d)\ 



total 

< 



,D-l-2d 



0(N D ) and 



/ N \(D-2d 

h \D-u)\\%\-d) 

(ii) develop an algorithm for the construction of I(Vnxd) 
with complexity 0(N D+1 ). 

The labeling function is based on pair-wise comparisons 
of the elements of Vc(<p 1 . D _ 1 ) while each element of 
I Vjvx£> c (0i:D-i)l i s a continuous function of 4> 1 . D _ l , a D- 
dimensional hypersurface, and any point </>!•£>„! is mapped 
to an index-set / which is determined by comparing the 
magnitudes of these hypersurfaces at <f> 1 . D _ 1 . Due to the 
continuity of hypersurfaces, the index-set / does not change 
in the "neighborhood" of 4>\-.d-v A necessary condition for 
the / set to change is two of the hypersurfaces to change their 
magnitude ordering. The switching occurs at the intersection 



of two hypersurfaces where we have | (Vc(<p 1 . D _ 1 )^j . | = 
| (Vc(^) 1:D _ 1 )) j |, % ^ j, which yields 



tan 



:V,, ;: /; ■ V,,.. : /.l '««/>,:/, ,,' 
Vi,i T Vj,i 



Functions (f>\ = tan 1 (— ^ v '- 2; - p 



'3,1 



(20) 



and 



' i = tan" 1 (- (Vi - 2:D+ vf 1 2 +^'° (02!g - l) ) determine (£>-!)- 



-Vi,= 



dimensional hypersurfaces S(Vi t - ; Vj i: ) and S(Vi 
respectively. Each hypersurface partitions into two 

regions. 

For convenience, in the following we use a pair {i,j} to 
denote the rows of matrix V^xD, that originate hypersurface 
S ^-py V|j| ; . ; Moreover, we allow i and j to be 

negative in order to encapsulate the information about the 
sign with which each row participates in the generation of 
hypersurface S, i.e. 



{i, J}^S( -.Vu 



-!i| Vi - : 



(21) 



where i,j e {—N, . .. , —1, 1, . . . , N}, 

Let {ii,i 2 , ■ ■ ■ ,%d) C {1,2, . . . ,N} where ...,id} 
is one among the (^) size-/? subsets of {1,2,..., N}. 
Then, by keeping i\ fixed (where i-y is arbitrarily selected, 
say i\ is the minimum among i\, i 2 , . . . , Id) and assign- 
ing signs to i2,«3, ■•■,«£) we can generate 2 D ~ 1 sets of 
the form ±i 2 , ■ . ■ , ±Id}- Hence, we can create totally 
(^)2 £)_1 such sets which we call Ji, J 2 , . . . , J/w\ 2D _i. 
We can show (the proof is omitted due to lack of space) 
that, for any I = 1, 2, . . . , (£)2 £> - 1 , the (£) hypersurfaces 

s (iff v i»iv ; iff v b'i. : ) that we obtain for i^j} c J i have a 

single common intersection point 4>{Ji) which "leads" at most 
(l Si) cells. Each such cell is associated with an index-set in 

the sense that /(V nxd', <Pi-.d-i) i s maintained for all <pv.D-i 
in the cell. In other words, the /-set associated with all points 
&1-.D-1 m tne interior of the cell is the same as the /-set at the 
leading vertex. In fact, the actual sorting of |Vc(0 1 .£ ) _ 1 )| for 
all points in the interior of the cell is the same as the sorting 
at the leading vertex and the /-set may characterize a greater 
area that includes many cells. In addition, we can show that 
examination of all such cells is sufficient for the computation 
of all index-sets that appear in the partition of and have 

a leading vertex. 

We collect all index-sets into I{V NxD ) and observe that 
2^(Vat X £)) can only be a subset of the set of all possible (^) 
index-sets. In addition, since the cells are defined by a leading 
vertex, we conclude that there are at most (^) j)2 £l_1 
cells. We finally note that there exist cells that are not associ- 
ated with an intersection-vertex. We can show that such cells 
can be ignored unless they are defined when 4>d-2 = f . In the 
latter case, we just have to identify the cells that are determined 
by the reduced-size matrix V nx(d-2) over me hypercube 
$ D - 3 . Hence, X t ot(VjvxD) = I(Vjvxd) U 2: to t(Vjv X (D-2)) 



and, by induction, 

^totO^Nxd) = Z(VNxd) UXtot(Vjv x (d-2)) ; 3 < d < D, 
which implies that 

Ztot(VArxD) 

= 1(Vnxd)U1(V Nx{d _ 2) ) U . . .UI(V NX(D _ 2LVJ) ) 
= |J l(V Nx(D _ 2d) ). (22) 

d=0 

As a result, the cardinality of X tot (V nxd) is 

\Itot(V NxD )\ 

< \I(Vnxd)\ + \1(V Nx(d _ 2) )\ + 

V y D - 2 



l T ( V iVx(D-2L° 5 



< 



( D )([fj 



N 

2 L ^ 



D-3 



E ( 



d=0 



JV 
D - 2d 



D _l_2L°fij 



: 0(N D ). 



(23) 



It remains to show how Z(Vtvxd) is constructed. 

As already mentioned, there are in total (^)2 £)_1 inter- 
section points which can all be blindly examined. For any 
I = 1,2, . . . , (^)2 D -\ the cell leading vertex </>(J ; ) is 
computed efficiently as the intersection of D— 1 hypersurfaces, 
i.e. the unique solution of 

/ V,.. ; :/) : V,.. V:D \ 



c{<j> 



l-.D-l) 







(D-l)xli 



(24) 



which is obtained in time 0(1) with respect to N. After 
4>{ Ji) has been computed, we have to identify the index-sets 
associated with the cells that originate at <fi(Ji) by calling the 
labeling function I(Vn x d 4>{Ji)) which marks the K ele- 
ments of the /-set, i.e. the indices of the K largest elements of 
|Vc(0( ,h))\, 1 = 1,2,..., (^)2 D - 1 . Since 0( J t ) constitutes 
the intersection of D hypersurfaces that correspond to the D 
elements of 4>(Ji), the corresponding values in \Vc(<fi(Ji))\ 
equal each other. If the K largest values in |Vc(<^(Jj))| 
contain Di values that correspond to the D elements of J;, 
then we can blindly examine all (^) cases of index-sets to 
guarantee that all actual index-sets are included. The term ) 
is maximized for Di = [^J, hence at most index-sets 

correspond to intersection point (f>(Ji). 

The complexity to build I(Vtvxd) results from the parallel 
examination of (^)2 £)_1 intersection points while worst- 
case complexity O(N) + is required at each point 
for the identification of the corresponding index-sets. The 
term O(N) corresponds to the cost required to determine 
the ifth-order element of |Vc(0(J;))| in an unsorted array. 
Consequently, the worst-case complexity to build I(Vjv x d) 
becomes Q^ 15 " 1 x (o(N) + ( Jj)) = 0(N D+1 ). Finally, 



we recall that the cardinality of I to t(^NxD) is upper bounded 
by 0(N D ) and conclude that the overall complexity of our 
algorithm for the evaluation of the sparse principal component 
of VV T is upper bounded by 0{N D+l ). 

V. Conclusions 

We considered the problem of identifying the sparse prin- 
cipal component of a rank-deficient matrix. We introduced 
auxiliary spherical variables and proved that there exists a set 
of candidate index-sets whose size is polynomially bounded, 
in terms of rank, and contains the optimal index-set, i.e. the 
index-set of the nonzero elements of the optimal solution. 
Finally, we developed an algorithm that computes the optimal 
sparse principal component in polynomial time. Our proposed 
algorithm stands as a constructive proof that the computation 
of the sparse principal component of a rank-deficient matrix 
is a polynomially solvable problem. 
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